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1.  Summary 

Intrusion  detection  systems  (IDSs)  are  currently  developed  using  pure  knowledge¬ 
engineering  approaches  where  expert  knowledge  on  network,  operating  systems,  and 
attack  methods  are  encoded  as  detection  models.  The  IDSs  are  not  very  effective  in 
detecting  variations  of  known  attacks  and  novel  attacks  because  expert  knowledge  is 
often  incomplete  and  tends  to  be  too  specific  to  attack  instances.  Since  the  manual 
development  process  is  very  slow  and  expensive,  IDSs  are  often  equipped  with  only  one 
centralized  detection  module,  making  them  unable  to  keep  up  with  fast  (automated) 
attacks,  and  worse,  subject  to  denial-of-service  attacks.  IDSs  are  not  cost-sensitive 
because  the  cost  factors,  which  include  the  development  and  operational  costs,  and  the 
intrusion  costs  (damages),  etc.,  are  simply  ignored  as  unwanted  complexities  in  the  IDS 
life  cycle. 

The  research  proposed  aims  to  develop  methodologies  and  tools  for  building  cost- 
sensitive  and  light  intrusion  detection  models.  The  main  technical  components  of  the 
research  are: 

•  Automatically  constructing  features  and  anomaly  detection  models  by  analyzing 
the  patterns  of  normal  and  intrusion  activities  computed  from  large  amount  of 
audit  data. 

•  Using  cost-sensitive  machine  learning  algorithms  to  construct  intrusion  detection 
models  that  achieve  optimal  performance  on  the  given  (often  site-specific)  cost 
metrics,  cluster  attack  signatures  and  normal  profiles  and  accordingly  construct 
one  light  model  for  each  cluster  to  maximize  the  utility  of  each  model. 

•  Dynamic  (re-)configuration  of  the  light  models  to  make  an  IDS  effective  and 
efficient,  and  resilient  to  IDS-related  attacks. 

We  have  successfully  accomplished  the  goals  of  the  project.  We  developed  several  novel 
feature  construction  and  anomaly  detection  algorithms.  In  particular,  we  invented  very 
light-weight  anomaly  detection  algorithms  that  analyze  the  frequent  values  of  packet 
header  fields  or  protocol  commands  in  packet  payloads  and  detect  deviations  (anomalies). 
Results  on  DARPA  IDS  Evaluation  data  and  real-world  data  showed  that  these 
algorithms  can  effectively  detect  new  attacks. 

We  studied  the  problem  of  cost-sensitive  modeling  in  intrusion  detection.  We  examined 
the  cost  factors  in  intrusion  detection,  namely,  damage  cost,  response  cost,  and  operation 
cost.  We  showed  how  the  performance  of  an  IDS,  i.e.,  a  true  positive,  false  positive,  true 
negative,  and  false  negative,  affects  the  total  cost  incurred.  For  example,  responding  to  an 
intrusion  with  higher  response  cost  than  damage  cost  will  cost  more  than  not  responding 
to  the  intrusion.  We  developed  strategies  for  an  IDS  to  decide  whether  (and  when)  to 
“ignore”  some  intrusions  in  order  to  minimize  cost. 

We  studied  how  to  dynamically  re-configure  a  real-time  IDS  to  provide  the  optimal 
protection,  and  developed  a  control  and  optimization  approach  that  decides  the  optimal 
IDS  configuration  based  on  resource  constraints  and  traffic  and  attack  conditions.  This 
problem  is  modeled  as  a  Knapsack  problem.  Essentially,  an  IDS  has  limited  real-time 
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resources  and  may  not  be  able  to  process  all  packets  if  the  traffic  rate  and  volume  is  too 
high.  The  solution  is  for  the  IDS  enable  only  the  most  “valuable”  set  of  tasks  so  that  the 
corresponding  traffic  data  can  be  analyzed  within  the  resource  constraints.  We  call  such 
an  IDS  an  Adaptive  IDS.  We  have  modified  the  open-source  Snort  and  Bro  to  make  them 
adaptive.  Experiments  showed  that  these  IDSs  can  automatically  change  its 
configurations  according  to  traffic  and  attack  conditions  to  provide  the  best  values. 

In  addition  to  the  research  tasks  outlined  in  the  original  proposal,  we  have  also  studied 
the  problem  of  alert  correlation  (a  topic  not  included  in  the  original  proposal).  Instead  of 
a  pattern-matching  approach  that  can  only  recognized  known  attack  step  relationships,  we 
aim  to  develop  algorithms  for  detecting  new  attack  step  relationships.  We  developed  a 
statistical  causality  analysis  approach,  based  on  GCT  (Granger  Causality  Test),  which 
with  very  little  prior  domain  knowledge  can  find  out  a  pair  of  alerts  of  the  likely  related 
attack  steps.  The  intuition  of  this  approach  is  that  related  attack  steps  may  result  in  co¬ 
occurrences  of  their  alerts  in  the  alert  data  streams.  Therefore,  statistical  tools  such  GCT 
can  be  applied  to  find  such  occurrences.  Experiments  using  DARPA’s  Grand  Challenge 
Problem  (GCP)  dataset  showed  that  this  approach  can  indeed  find  novel  attack  step 
relationships  that  other  approaches  based  patter-matching  can’t. 

The  results  of  this  research  have  been  reported  in  many  publications  in  top  conferences 
and  journals.  In  addition,  we  have  actively  engaged  in  technology  transfer  throughout  the 
course  of  the  project.  In  particular,  the  Pis  were  involved  in  the  founding  of  System 
Detection  Inc.  The  company  has  been  developing  commercial  products  based  on  findings 
of  this  project  and  the  previous  D  ARP  A- funded  JAM  project. 

2.  Introduction 

Intrusion  detection  is  the  process  of  identifying  and  responding  to  malicious  actions  that 
aim  to  compromise  the  security  of  a  system,  i.e.,  its  confidentiality,  integrity,  and 
availability.  The  basic  premises  of  intrusion  detection  are:  system  activities  are 
observable,  e.g.,  via  auditing;  and  normal  and  intrusion  activities  leave  distinct  evidence. 
Therefore,  an  ID  model  has  two  basic  elements:  the  features,  that  is,  the  indicators 
(evidence),  measured  using  the  audit  data;  and  the  modeling  algorithms  that  piece 
together  and  reason  about  the  indicators.  The  two  main  intrusion  detection  techniques 
include  misuse  detection,  which  uses  signatures  of  specific  attacks  or  system 
vulnerabilities  to  pattern-match  and  detect  intrusions;  and  anomaly  detection,  which  uses 
established  normal  profiles  of  users  or  system  resources  to  detect  significant  deviation  as 
probable  intrusion.  Misuse  detection  can  be  very  efficient  and  accurate,  however,  by 
definition,  it  can  detect  only  the  instances  of  known  intrusions.  Anomaly  detection  is  the 
only  weapon  to  detect  new  attacks,  however,  it  often  cannot  determine  the  nature  of  an 
attack  and  can  have  a  high  false  alarm  rate.  An  IDS  therefore  needs  to  carefully  combine 
both  misuse  and  anomaly  detection  models. 

Despite  the  research  and  commercial  efforts  in  the  past  two  decades,  there  are  still  a  large 
gap  between  the  capabilities  of  IDSs  and  that  of  cyber  attackers.  Results  from  the  1998 
DARPA  Intrusion  Detection  Evaluation  showed  that  although  several  intrusion  detection 
programs  already  showed  good  detection  rates  on  known  intrusions  and  their  slight 
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variations,  none  of  the  systems  showed  aceeptable  deteetion  rate  on  “novel”  attacks,  i.e., 
those  that  are  not  modeled  in  the  detection  systems.  Most  IDSs  are  designed  only  to 
achieve  optimal  effectiveness  (i.e.,  accuracy).  However,  for  IDSs  to  be  widely  deployed, 
they  need  to  bring  economic  benefits  to  organizations.  This  requires  that  IDSs  balance  the 
requirements  of  both  accuracy  and  costs,  which  include  development  costs,  operational 
costs,  damage  (intrusion)  costs,  etc.  Real-time  IDSs  need  to  avoid  becoming  a  single 
point  of  failure  because  cyber  attackers  are  beginning  to  devise  attacks  that  aim  to  elude 
IDSs  through  evasion  tactics  or  denial-of-service.  Multiple  light,  fast,  and  cooperative 
detection  systems  are  likely  to  achieve  more  robust  performance  than  using  a  monolithic 
system.  Most  IDSs  only  output  alarms  on  individual  steps.  When  IDSs  are  deployed  in  a 
large  network,  the  sheer  amount  of  IDS  alerts  can  overwhelm  the  security  staff  and 
prevent  proper  and  timely  response  actions.  Therefore,  we  need  to  develop  techniques  to 
reduce  the  amount  of  alerts,  correlate  the  alerts  and  recognize  complex  attack  scenarios 
that  are  composed  of  a  number  of  attack  steps. 

The  traditional  manual  approaches  of  encoding  expert  knowledge  cannot  meet  the 
challenges  of  building  IDSs  that  are  equipped  with  the  advanced  capabilities  discussed 
above.  To  effectively  detect  novel  attacks,  an  IDS  needs  to  provide  comprehensive  and 
systematic  coverage,  i.e.,  modeling,  of  all  network  elements  and  their  interactions.  Expert 
knowledge  is  simply  too  limited  compared  with  the  complexities  of  a  network  system. 

The  delicate  balance  between  accuracy  and  various  cost  factors,  and  the  need  to  construct 
multiple  cooperative  models  also  add  significant  complexities  in  the  development 
process.  In  alert  analysis  and  attack  scenario  analysis  the  key  is  to  identify  the  attack 
steps  that  are  related.  There  are  potentially  many  possible  attack  scenarios.  Thus  it  is 
impossible  to  know  a  priori  what  attack  step  relationships  are  indicative  of  attack  steps  in 
a  scenario. 

We  therefore  need  a  new  development  paradigm.  We  proposed  to  build  and  demonstrate 
a  novel  system  for  rapid  development  and  deployment  of  effective  and  cost-sensitive 
IDSs.  The  key  motivation  of  our  research  is  to  automate  as  much  as  possible  the  analysis 
tasks  in  intrusion  detection.  We  consider  intrusion  detection  as  a  classification  problem, 
that  is,  we  wish  to  classify  each  audit  record  into  one  of  a  discrete  set  of  possible 
categories,  normal  or  a  particular  kind  of  intrusion.  We  can  thus  apply  machine  learning 
approaches  to  inductively  learn  classifiers  as  detection  models.  Given  a  set  of  records, 
where  one  of  the  features  is  the  class  label  (i.e.,  the  concept),  classification  algorithms 
can  compute  a  model  that  uses  the  most  discriminating  feature  values  to  describe  each 
concept.  However,  before  we  can  apply  classification  algorithms,  we  need  to  first  select 
and  construct  the  right  set  of  system  features  that  may  contain  evidence  (indicators)  of 
normal  or  intrusions.  We  developed  an  automatic  feature  selection  and  construction 
system  to  systematically  discover  and  construct  predictive  features  that  can  be  used  to 
build  effective  misuse  and  anomaly  detection  models.  We  developed  cost-sensitive 
classification  algorithms  to  construct  ID  models  that  are  optimized  to  provide  the  best 
economic  benefits  (cost-saving).  We  also  studied  how  to  efficiently  execute  ID  models  in 
real-time.  In  alert  analysis,  we  developed  algorithms  to  recognize  new  attack  step 
relationships. 
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3.  Methods,  Assumptions,  and  Procedures 

As  academic/university  researchers,  we  aimed  to  make  fundamental  contributions,  rather 
than  to  develop  product  prototypes.  We  therefore  focused  on  theoretical  studies, 
algorithm  developments,  and  research  prototype  implementations.  As  discussed  above,  a 
key  motivation  is  to  automate  the  analysis  tasks  in  intrusion  detection.  Toward  this  end, 
we  applied  machine  learning,  data  mining,  statistical  analysis,  and  control  and 
optimization  techniques.  For  example,  in  alert  analysis  and  attack  scenario  analysis, 
rather  than  using  the  approach  of  pattern-matching  of  known  attack  step  relationships, 
which  is  straightforward  but  of  limited  use  against  new  attack  scenarios,  we  developed  a 
statistical  causality  analysis  approach  that  while  still  preliminary  shows  great  potentials 
in  its  abilities  to  recognize  attack  step  relationships. 

We  communicated  closely  with  the  program  manager  and  interacted  with  other  research 
groups  in  the  community  to  get  feedbacks  on  our  research  directions  and  progress.  In 
particular,  we  utilized  the  scenarios  and  data  sets  provided  by  DARPA  and  other  project 
teams  as  training  or  validation  data  sets  for  our  algorithms. 

The  main  goal  of  our  project  was  to  build  a  development  system  so  that  effective,  cost- 
sensitive  and  light  ID  models  can  be  quickly  built  and  deployed.  We  also  developed  real¬ 
time  IDSs  equipped  with  our  ID  models  to  demonstrate  the  advanced  capabilities  of  our 
development  system.  Our  project  proceeded  as  follows. 

For  the  first  year,  we  concentrated  on  algorithm  development.  This  included;  enhancing 
and  integrating  existing  components  of  JAM,  e.g.,  the  data  mining  programs  for  audit 
data  analysis,  and  the  pattern  encoding  and  analysis  programs;  and  developing  initial 
versions  of  the  new  algorithmic  components.  We  established  the  capabilities  of 
automated  feature  construction.  We  studied  the  cost  factors  in  intrusion  detection  and 
developed  a  model  to  evaluate  an  IDS  based  on  cost.  We  also  developed  a  light-weight 
anomaly  detection  algorithm  that  analyzes  the  frequent  values  of  packet  header  fields. 
Experiments  using  DARPA  dataset  showed  that  this  algorithm  can  detect  many  new 
attacks. 

For  the  second  year,  we  developed  an  approach  for  learning  an  anomaly  detection  model 
over  noisy  (unclean)  data.  We  studied  the  problem  of  dynamically  changing  the 
configuration  of  an  IDS  to  provide  optimal  value  according  to  the  run-time  resource 
constraints  and  attack  conditions.  We  considered  it  as  a  control  and  optimization  problem 
and  developed  a  solution  based  on  the  Knapsack  algorithm.  We  also  started  to  investigate 
the  problem  of  alert  correlation  and  attack  scenario  analysis. 

For  the  third  year,  we  developed  a  new  and  light-weight  anomaly  detection  algorithm  that 
analyzes  the  frequent  values  of  protocol  commands  in  packet  payloads.  Experiments 
using  DARPA  dataset  showed  that  this  algorithm  can  detect  new  attacks.  We  modified 
two  open-source  IDSs,  Bro  and  Snort,  to  make  them  adaptive  using  our  Knapsack  based 
approach.  Experiments  showed  that  these  IDSs  can  dynamically  change  their 
configurations  to  provide  the  best  detection  capabilities  according  to  run-time  conditions. 
We  also  developed  a  statistical  causality  analysis  algorithm,  based  on  the  Granger 
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Causality  Test  (GCT),  for  identifying  new  attaek  step  relationships.  Experiments  using 
the  DARPA  Grand  Challenge  Problem  (GCP)  dataset  showed  that  it  can  detect  new  and 
stealth  attack  step  relationships  that  other  pattern-matching  based  approaches  can’t. 

Throughout  the  whole  project  duration,  we  attended  regular  PI  meetings  and  visit  other 
groups  to  exchange  ideas  and  share  technologies.  We  published  our  findings  in  yearly 
major  conferences  so  that  our  work  can  be  reviewed  critically  from  the  scientific 
communities.  We  also  actively  participated  in  technology  transfer.  In  particular,  the  Pis 
were  part  of  the  founding  team  of  System  Detection  Inc.,  which  is  developing 
commercial  products  based  on  technologies  developed  in  this  project  and  previous 
DARPA-funded  JAM  project. 

4.  Results  and  Discussion 

We  now  summarize  the  main  results  of  the  project. 

4,1  Feature  Construction 

Two  basic  premises  of  intrusion  detection  are  that  system  activities  are  observable,  e.g., 
via  auditing,  and  there  is  distinct  evidence  that  can  distinguish  normal  and  intrusive 
activities.  We  call  the  evidence  extracted  from  raw  audit  data  features,  and  use  these 
features  for  building  and  evaluating  intrusion  detection  models.  Feature  extraction  (or 
construction)  is  the  processes  of  determining  what  evidence  that  can  be  taken  from  raw 
audit  data  is  most  useful  for  analysis.  Feature  extraction  is  thus  a  critical  step  in  building 
an  IDS.  That  is,  having  a  set  of  features  whose  values  in  normal  audit  records  differ 
significantly  from  the  values  in  intrusion  records  is  essential  for  having  good  detection 
performance. 

We  have  developed  a  set  of  data  mining  algorithms  for  selecting  and  constructing 
features  from  audit  data  [1].  First,  raw  (binary)  audit  data  is  processed  and  summarized 
into  discrete  records  containing  a  number  of  basic  features,  e.g.,  timestamp,  duration, 
source  and  destination  IP  addresses  and  ports,  and  error  condition  flags.  Specialized  data 
mining  programs  [2]  are  then  applied  to  connection  records  to  compute  frequent  patterns 
describing  correlations  among  features  and  frequently  co-occurring  events  across  many 
connection  records.  The  consistent  patterns  of  normal  activities  and  the  "unique"  patterns 
associated  with  an  intrusion  are  then  identified  and  analyzed  to  construct  additional 
features  for  connection  records.  It  can  be  shown  that  the  constructed  features  can  indeed 
clearly  separate  intrusion  records  from  normal  ones.  Using  this  approach,  the  constructed 
features  are  more  grounded  on  empirical  data,  and  thus  more  objective  than  expert 
knowledge.  Results  from  the  1998  DARPA  Intrusion  Detection  Evaluation  [3]  showed 
that  the  ID  model  constructed  using  our  algorithms  was  one  of  the  best  performing  of  all 
the  participating  systems. 

As  an  example,  let  us  consider  the  SYN-Flood  attack.  When  launching  this  attack,  an 
attacker  uses  many  spoofed  source  addresses  to  open  many  connections  which  never 
become  completely  established  (i.e.,  only  the  first  SYN  packet  is  sent,  and  the  connection 
remains  in  the  "SO"  state)  to  some  port  on  a  victim  host  (e.g.,  http).  When  comparing  the 
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patterns  from  the  1998  DARPA  dataset  that  contain  SYN-Flood  attacks  with  the  patterns 
from  a  “baseline”  normal  dataset  (of  the  same  network),  by  first  encoding  the  patterns 
into  numbers  and  then  computing  “difference”  scores.  The  following  pattern,  a  frequent 
episode  [4],  has  the  highest  “intrusion-only”  (i.e.,  unique  for  the  intrusion):  “93%  of  the 
time,  after  two  http  connections  with  SO  flag  are  made  to  host  victim,  within  2  seconds 
from  the  first  of  these  two,  the  third  similar  connection  is  made;  and  this  pattern  occurs  in 
3%  of  the  data”.  Accordingly,  our  feature  construction  algorithm  parses  the  pattern  and 
uses  the  anatomy  (or  structural)  information  about  an  intrusion,  e.g.,  "the  same  service 
{i.Q.,port)  is  targeted",  and  the  invariant  information,  e.g.  flag=S0,  to  construct  the 
following  features:  “a  count  of  connections  to  the  same  dst_host  in  the  past  2  seconds”, 
and  among  these  connections,  “the  percentage  of  those  that  have  the  same  service,  and 
the  percentage  of  those  that  have  the  SO  flag.”  For  the  two  “percentage”  features,  the 
normal  connection  records  have  values  close  to  0,  but  the  connection  records  belong  to 
SYN-Flood  have  value  above  80%.  Once  these  discriminative  features  are  constructed,  it 
is  easy  to  generate  the  detection  rules  via  either  manual  (i.e.  hand-coding)  or  automated 
(i.e.,  machine  learning)  techniques.  For  example,  we  use  RIPPER  [5],  an  inductive  rule 
learner,  to  compute  a  detection  rule  for  syn-flood:  if  for  the  past  2  seconds,  the  count  of 
connections  to  the  same  dst  host  is  greater  than  4;  and  the  the  percentage  of  those  that 
have  the  same  service  is  greater  than  75%;  and  the  percentage  of  those  that  have  the  "SO" 
flag  is  greater  than  75%,  then  there  is  a  syn  Jlood  attack. 

We  have  implemented  a  system  that  fully  automated  the  features  and  model  construction 
process.  The  inputs  are  two  sets  of  connection  records,  one  for  normal  connections  and 
the  other  contains  an  attack.  The  connection  records  contain  the  basic  features.  The 
system  than  computes  frequent  patterns  from  both  sets  of  connection  records,  compare 
the  patterns  to  identify  the  to  10%  intrusion-only  patterns,  parses  the  patterns  to  construct 
features,  and  invokes  RIPPER  to  learn  rules  to  detect  the  intrusion.  The  learned  rules  are 
tested  on  a  given  test  dataset.  If  the  accuracy  is  below  a  pre-defmed  threshold,  the  above 
process  is  iterated,  with  different  heuristics  in  pattern  computation,  until  a  set  of 
sufficiently  accurate  rules  are  computed  or  a  pre-defined  limit  (e.g.,  on  the  number  of 
iterations)  is  met. 

4,2  Unsupervised  Learning 

Traditional  model  building  algorithms  typically  require  a  large  amount  of  labeled  data  in 
order  to  create  effective  detection  models.  One  major  difficulty  in  deploying  a  data 
mining-based  IDS  is  the  need  for  labeling  system  audit  data  for  use  by  these  algorithms. 
Eor  misuse  detection  systems,  the  data  needs  to  be  accurately  labeled  as  either  normal  or 
attack.  Eor  anomaly  detection  system,  the  data  must  be  verified  to  ensure  it  is  completely 
normal,  which  requires  the  same  effort.  Since  models  (and  data)  are  specific  to  the 
environment  on  which  the  training  data  was  gathered,  this  cost  of  labeling  the  data  must 
be  incurred  for  each  deployment  of  the  system.  Ideally,  we  would  like  to  build  detection 
models  from  collected  data  without  needing  to  manually  label  it.  In  this  case,  the 
deployment  cost  would  greatly  be  decreased  because  the  data  would  not  need  to  be 
labeled.  In  order  to  build  these  detection  models,  we  need  a  new  class  of  model  building 
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algorithms.  These  model  building  algorithms  ean  take  as  input  unlabeled  data  and  ereate 
a  deteetion  model.  We  eall  these  algorithms  unsupervised  anomaly  deteetion  algorithms. 

We  developed  an  overview  of  two  unsupervised  anomaly  detection  algorithms  that  have 
been  applied  to  intrusion  detection.  These  algorithms  can  also  be  referred  to  as  anomaly 
detection  over  noisy  data.  The  reason  the  algorithm  must  be  able  to  handle  noise  in  the 
data  is  that  we  do  not  want  to  manually  verify  that  the  audit  data  collected  is  absolutely 
clean  (i.e.,  contains  no  intrusions).  Unsupervised  anomaly  detection  algorithms  are 
motivated  by  two  major  assumptions  about  the  data  which  are  reasonable  for  intrusion 
detection.  The  first  assumption  is  that  anomalies  are  very  rare.  This  corresponds  to  the 
fact  that  normal  use  of  the  system  greatly  outnumbers  the  occurrence  of  intrusions.  This 
means  that  the  attacks  compose  a  relatively  small  proportion  of  the  total  data.  The  second 
assumption  is  that  the  anomalies  are  quantitatively  different  from  the  normal  elements.  In 
intrusion  detection  this  corresponds  to  the  fact  that  attacks  are  drastically  different  from 
normal  usage. 

Since  anomalies  are  very  rare  and  quantitatively  different  from  the  normal  data,  they 
stand  out  as  outliers  in  the  data  set.  Thus,  we  can  cast  the  problem  of  detecting  the  attacks 
into  an  outlier  detection  problem.  Outlier  detection  is  the  focus  of  much  literature  in  the 
field  of  statistics  [6].  In  intrusion  detection,  intuitively,  if  the  ratio  of  attacks  to  normal 
data  is  small  enough,  then  because  the  attacks  are  different,  the  attacks  stand  out  against 
the  background  of  normal  data.  We  can  thus  detect  the  attack  within  the  dataset. 

We  have  performed  experiments  with  two  types  of  unsupervised  anomaly  detection 
algorithms,  each  for  a  different  type  of  data.  We  applied  a  probabilistic  based 
unsupervised  anomaly  detection  algorithm  to  building  detection  models  over  system  calls 
and  a  clustering  based  unsupervised  anomaly  detection  algorithm  to  network  traffic.  The 
probabilistic  algorithm  approached  detecting  outliers  by  estimating  the  likelihood  of  each 
element  in  the  data.  We  partition  the  data  into  two  sets,  normal  elements  and  anomalous 
elements.  Using  a  probability  modeling  algorithm  over  the  data,  we  compute  the  most 
likely  partition  of  the  data.  Details  and  experimental  results  of  the  algorithm  applied  to 
system  call  data  are  given  in  [7].  The  clustering  approach  detects  outliers  by  clustering 
the  data.  The  intuition  is  that  the  normal  data  will  cluster  together  because  there  is  a  lot  of 
it.  Because  anomalous  data  and  normal  data  are  very  different  from  each  other,  they  do 
not  cluster  together.  Since  there  is  very  little  anomalous  data  relative  to  the  normal  data, 
after  clustering,  the  anomalous  data  will  be  in  the  small  clusters.  The  algorithm  first 
clusters  the  data  and  then  labels  the  smallest  clusters  as  anomalies. 

Details  and  experimental  results  applied  to  network  data  are  given  in  [8]. 

4,3  Light-Weight  Anomaly  Detection 

Most  network  anomaly  systems  such  as  ADAM  [9],  NIDES  [10],  and  SPADE  [11] 
monitor  IP  addresses,  ports,  and  TCP  state.  This  catches  user  misbehavior,  such  as 
attempting  to  access  a  password  protected  service  (because  the  source  address  is  unusual) 
or  probing  a  nonexistent  service  (because  the  destination  address  and  port  are  unusual). 
However,  this  misses  attacks  on  public  servers  or  the  TCP/IP  stack  that  might  otherwise 
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be  detected  because  of  anomalies  in  other  parts  of  the  protocol.  Often  these  anomalies 
occur  because  of  software  errors  in  the  attacking  or  victim  program,  because  of 
anomalous  output  after  a  successful  attack,  or  because  of  misguided  attempts  to  elude  the 
IDS.  Our  anomaly  detection  algorithms  have  two  nonstationary  components  developed 
and  tested  on  the  1999  DARPA  IDS  evaluation  test  set  [12],  which  simulates  a  local 
network  under  attack.  The  first  component  is  a  packet  header  anomaly  detector  (PHAD) 
which  monitors  the  entire  data  link,  network,  and  transport  layer,  without  any 
preconceptions  about  which  fields  might  be  useful.  The  second  component  is  an 
application  layer  anomaly  detector  (ALAD)  which  combines  a  traditional  user  model 
based  on  TCP  connections  with  a  model  of  text-based  protocols  such  as  HTTP,  FTP,  and 
SMTP.  Both  systems  learn  which  attributes  are  useful  for  anomaly  detection,  and  then 
use  a  nonstationary  model,  in  which  events  receive  higher  scores  if  no  novel  values  have 
been  seen  for  a  long  time. 

4,3.1  LEARNING  NONSTATIONARY  MODELS 

The  goal  of  intrusion  detection  is,  for  any  given  event  x,  to  assign  odds  that  x  is  hostile, 
e.g.,  odds  (x_is_hostile)  =  P(attack|x)  /  P(no_attack|x) 

By  Bayes  law,  we  can  write: 

P(attack|x)  =  P(x|attack)P(attack)  /  P(x) 

P(no_attack|x)  =  P(x|no_attack)P(no_attack)  /  P(x) 

By  dividing  these  equations,  and  letting  odds(attack)  =  P(attack)  /  P(no_attack),  we  have: 

odds(x_is_hostile)  =  odds(attack)P(x|attack)  /  P(x  |  no_attack) 

We  have  factored  the  intrusion  detection  problem  into  three  terms:  odds(attack),  the 
background  rate  of  attacks;  P(x|attack),  a  signature  detection  model,  and  1  / 
P(x|no_attack),  an  anomaly  detection  model.  In  this  paper,  we  address  only  the  anomaly 
detection  component,  1  /  P(x|no_attack).  Thus,  we  model  attack-  free  data,  and  assign 
(like  SPADE)  anomaly  scores  inversely  proportional  to  the  probability  of  an  event  based 
on  this  training.  Anomaly  detection  models  like  ADAM,  NIDES,  and  SPADE  are 
stationary,  in  that  P(x)  depends  on  the  average  rate  of  x  in  training  and  is  independent  of 
time.  Eor  example,  the  probability  of  observing  some  particular  IP  address  is  estimated 
by  counting  the  number  of  observations  in  training  and  dividing  by  the  total  number  of 
observations.  However,  this  may  be  incorrect.  Paxson  and  Eloyd  [13]  showed  that  many 
types  of  network  processes,  such  as  the  rate  of  a  particular  type  of  packet,  have  self¬ 
similar  or  fractal  behavior.  This  is  a  nonstationary  model,  one  in  which  no  sample,  no 
matter  how  short  or  long  can  predict  the  rate  of  events  for  any  other  sample.  Instead,  they 
found  that  events  tend  to  occur  in  bursts  separated  by  long  gaps  on  all  time  scales,  from 
milliseconds  to  months.  We  believe  this  behavior  is  due  to  changes  of  state  in  the  system, 
such  as  programs  being  started,  users  logging  in,  software  and  hardware  upgrades,  and  so 
on.  We  can  adapt  to  state  changes  by  exponentially  decaying  the  training  counts  to  favor 
recent  events,  and  many  models  do  just  that.  One  problem  with  this  approach  is  that  we 
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have  to  choose  either  a  decay  rate  (half  life)  or  a  maximum  count  in  an  ad-hoc  manner. 
We  avoid  this  problem  by  taking  training  decay  to  the  extreme,  and  discarding  all  events 
(an  attribute  having  some  particular  value)  before  the  most  recent  occurrence.  In  our 
model,  the  best  predictor  of  an  event  is  the  time  since  it  last  occurred.  If  an  event  x  last 
occurred  t  seconds  ago,  then  the  probability  that  x  will  occur  again  within  one  second  is 
Ht.  We  do  not  care  about  any  events  prior  to  the  most  recent  occurrence  of  x. 

In  an  anomaly  detection  system,  we  are  most  interested  in  those  events  that  have  the 
lowest  probability.  As  a  simplification,  we  assign  anomaly  scores  only  to  those  events 
that  have  never  occurred  in  training,  because  these  are  certainly  the  least  likely.  We  use 
the  PPMC  model  of  novel  events,  which  is  also  used  in  data  compression  [14].  This 
model  states  that  if  an  experiment  is  performed  n  times  and  r  different  outcomes  are 
observed,  then  the  probability  that  the  next  outcome  will  not  be  one  of  these  r  values  is 
approximately  r/n.  Stated  another  way,  the  fraction  of  events  that  were  novel  in  training 
is  r/n,  and  we  expect  that  rate  to  continue.  This  probably  overestimates  the  probability 
that  the  next  outcome  will  be  novel,  since  most  of  the  novel  events  probably  occurred 
early  during  training.  Nevertheless,  we  use  it.  Because  we  have  separate  training  data 
(without  attacks)  and  test  data  (with  attacks),  we  cannot  simply  assign  an  anomaly  score 
of  1/P(x)  =  nir.  If  we  did,  then  a  subsequent  occurrence  ofx  would  receive  the  same 
score,  even  though  we  know  (by  our  nonstationary  argument)  that  a  second  occurrence  is 
very  likely  now.  We  also  cannot  add  it  to  our  model,  because  the  data  is  no  longer  attack- 
free.  Instead,  we  record  the  time  of  the  event,  and  assign  subsequent  occurrences  a  score 
of  t/P(x)  =  tn/r,  where  t  is  the  time  since  the  previous  anomaly.  On  the  first  occurrence  of 
X,  t  is  the  time  since  the  last  novel  observation  in  training.  An  IDS  monitors  a  large 
number  of  attributes  of  a  message,  each  of  which  can  have  many  possible  outcomes.  For 
each  attribute  with  a  value  never  observed  in  training,  an  anomaly  score  of  tn/r  is 
computed,  and  the  sum  of  these  is  then  assigned  to  the  message.  If  this  sum  exceeds  a 
threshold,  then  an  alarm  is  signaled. 

anomaly  score  =  X/  U  ni/  ri,  where  attribute  i  is  novel  in  training 

We  next  describe  two  models,  PHAD  and  ALAD.  In  PHAD  (packet  header  anomaly 
detection),  the  message  is  a  single  network  packet,  and  the  attributes  are  the  fields  of  the 
packet  header.  In  ALAD  (application  layer  anomaly  detection),  the  message  is  an 
incoming  server  TCP  connection.  The  attributes  are  the  application  protocol  keywords, 
opening  and  closing  TCP  flags,  source  address,  and  destination  address  and  port  number. 

4,3, 1.1  Packet  Header  Anomaly  Detection  (PHAD) 

PHAD  monitors  33  fields  from  the  Ethernet,  IP,  and  transport  layer  (TCP,  UDP,  or 
ICMP)  packet  header.  Each  field  is  one  to  four  bytes,  divided  as  nearly  as  possible  on 
byte  boundaries  as  specified  by  the  RECs  (request  for  comments)  that  specify  the 
protocols,  although  we  had  to  combine  fields  smaller  than  8  bits  (such  as  the  TCP  flags) 
or  split  fields  longer  than  32  bits  (such  as  the  Ethernet  addresses).  The  value  of  each  field 
is  an  integer.  Depending  on  the  size  of  the  field,  the  value  could  range  from  0  to  2^^  -  1 . 
Because  it  is  impractical  to  represent  every  observed  value  from  such  a  large  range,  and 
because  we  wish  to  generalize  over  continuous  values,  we  represent  the  set  of  observed 
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values  with  a  set  of  eontiguous  ranges  or  elusters.  Eaeh  new  observed  value  forms  a 
eluster  by  itself.  If  the  number  of  elusters  exeeeds  a  limit,  C,  then  we  merge  the  two 
elosest  ones  into  a  single  cluster.  For  example,  if  C  =  3  and  we  have  {3-5,  8,  10-15,  20}, 
then  we  merge  the  two  closest  to  form  {3-5,  8-15,  20}.  For  the  purposes  of  anomaly 
detection,  the  number  of  novel  values,  r,  is  the  number  of  times  the  set  of  clusters  is 
updated. 

4,3. 1.2  Application  Layer  Anomaly  Detection  (ALAD) 

The  second  component  of  our  anomaly  detection  model  is  the  application  layer  anomaly 
detector  (ALAD).  Instead  of  assigning  anomaly  scores  to  each  packet,  it  assigns  a  score 
to  an  incoming  server  TCP  connection.  TCP  connections  are  reassembled  from  packets. 
ALAD,  unlike  PHAD,  is  configured  knowing  the  range  of  IP  addresses  it  is  supposed  to 
protect,  and  it  distinguishes  server  ports  (0-1023)  from  client  ports  (1024-65535).  We  do 
this  because  most  attacks  are  initiated  by  the  attacker  (rather  than  by  waiting  for  a 
victim),  and  are  therefore  against  servers  rather  than  clients.  We  tested  a  large  number  of 
attributes  and  their  combinations  that  we  believed  might  make  good  models,  and  settled 
on  five  that  gave  the  best  performance  individually  (high  detection  rate  at  a  fixed  false 
alarm  rate)  on  the  DARPA  IDS  evaluation  data  set  [12].  These  are; 

1 .  P(src  IP  I  dest  IP),  where  src  IP  is  the  external  source  address  of  the  client 
making  the  request,  and  dest  IP  is  the  local  host  address.  This  differs  from  PHAD 
in  that  the  probability  is  conditional  (a  separate  model  for  each  local  dest  IP),  only 
for  TCP,  and  only  for  server  connections  (destination  port  <  1024).  In  training, 
this  model  learns  the  normal  set  of  clients  or  users  for  each  host.  In  effect,  this 
models  the  set  of  clients  allowed  on  a  restricted  service. 

2.  P(src  IP  I  dest  IP,  dest  port).  This  model  is  like  (1)  except  that  there  is  a  separate 
model  for  each  server  on  each  host.  It  learns  the  normal  set  of  clients  for  each 
server,  which  may  be  differing  across  the  servers  on  a  single  host. 

3.  P(dest  IP,  dest  port).  This  model  learns  the  set  of  local  servers  which  normally 
receive  requests.  It  should  catch  probes  that  attempt  to  access  nonexistent  hosts  or 
services. 

4.  P(TCP  flags  I  dest  port).  This  model  learns  the  set  of  normal  TCP  flag  sequences 
for  the  first,  next  to  last,  and  last  packet  of  a  connection.  A  normal  sequence  is 
SYN  (request  to  open),  FIN-ACK  (request  to  close  and  acknowledge  the  previous 
packet),  and  ACK  (acknowledge  the  FIN).  The  model  generalizes  across  hosts, 
but  is  separate  for  each  port  number,  because  the  port  number  usually  indicates 
the  type  of  service  (mail,  web,  FTP,  telnet,  etc.).  An  anomaly  can  result  if  a 
connection  fails  or  is  opened  or  closed  abnormally,  possibly  indicating  an  abuse 
of  a  service. 

5.  P(keyword  |  dest  port).  This  model  examines  the  text  in  the  incoming  request 
from  the  reassembled  TCP  stream  to  learn  the  allowable  set  of  keywords  for  each 
application  layer  protocol.  A  keyword  is  defined  as  the  first  word  on  a  line  of 
input,  i.e.  the  text  between  a  linefeed  and  the  following  space.  ALAD  examines 
only  the  first  1000  bytes,  which  is  sufficient  for  most  requests.  It  also  examines 
only  the  header  part  (ending  with  a  blank  line)  of  SMTP  (mail)  and  HTTP  (web) 
requests,  because  the  header  is  more  rigidly  structured  and  easier  to  model  than 
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the  body  (text  of  email  messages  or  form  uploads).  An  anomaly  indieates  the  use 
of  a  rarely  used  feature  of  the  protoeol,  which  is  common  in  many  R2L  (remote- 
to-local)  attacks. 

As  with  PHAD,  the  anomaly  score  is  tn/r,  where  r  different  values  were  observed  out  of  n 
training  samples,  and  it  has  been  t  seconds  since  the  last  anomaly  was  observed.  An 
anomaly  occurs  only  if  the  value  has  never  been  observed  in  training. 

4,3.2  EXPERIMENTAL  RESULTS 

We  evaluated  PHAD  and  ALAD  by  running  them  at  the  same  time  on  the  1999  DARPA 
IDS  evaluation  data  set  and  merging  the  results.  Each  system  was  trained  on  week  3  (7 
days,  attack  free)  and  evaluated  on  the  180  detectable  labeled  attacks  from  weeks  4  and  5. 
To  merge  the  results,  we  set  the  two  thresholds  so  that  equal  numbers  of  alarms  were 
taken  from  both  systems,  and  so  that  there  were  100  total  false  alarms  (10  per  day 
including  the  missing  day)  after  removing  duplicate  alarms.  An  alarm  is  considered  a 
duplicate  if  it  identifies  the  same  IP  address  and  the  same  attack  time  within  60  seconds 
of  a  higher  ranked  alarm  from  either  system.  We  chose  60  seconds  because  DARPA 
criteria  allows  a  detection  to  be  counted  if  the  time  is  correctly  identified  within  60 
seconds  of  any  portion  of  the  attack  period.  Also,  to  be  consistent  with  DARPA,  we  count 
an  attack  as  detected  if  it  identifies  any  IP  address  involved  in  the  attack  (either  target  or 
attacker).  Multiple  detections  of  the  same  attack  (that  remain  after  removing  duplicates) 
are  counted  only  once,  but  all  false  alarms  are  counted.  In  Table  1  we  show  the  results  of 
this  evaluation.  In  the  column  labeled  det  we  list  the  number  of  attacks  detected  out  of  the 
number  of  detectable  instances,  which  does  not  include  missing  data  (week  4,  day  2)  or 
the  three  attack  types  {ntfsdos,  selfping,  snmpget)  that  generate  no  inside  traffic.  Thus, 
only  180  of  the  201  attack  instances  are  listed.  In  the  last  column  of  Table  1,  we  describe 
the  PHAD  and  ALAD  anomalies  that  led  to  the  detection,  prior  to  removing  duplicate 
alarms.  For  PHAD,  the  anomaly  is  the  packet  header  field  that  contributed  most  to  the 
overall  score.  For  ALAD,  each  of  the  anomalous  components  (up  to  5)  is  listed.  Based  on 
these  descriptions,  we  adjusted  the  number  of  detections  (column  det)  to  remove 
simulation  artifacts  and  coincidental  detections,  and  to  add  detections  by  Ethernet  address 
rather  than  IP  address,  which  would  not  otherwise  be  counted  by  DARPA  rules.  The 
latter  case  occurs  for  arppoison,  in  which  PHAD  detects  anomalous  Ethernet  addresses  in 
non-IP  packets.  Arppoison  disrupts  network  traffic  by  sending  spoofed  responses  to  ARP- 
who-has  requests  from  a  compromised  local  host  so  that  IP  addresses  are  not  correctly 
resolved  to  Ethernet  addresses. 

The  two  coincidences  are  mscan  (an  anomalous  Ethernet  address,  overlapping  an 
arppoison  attack),  and  illegalsniffer  (a  TCP  checksum  error).  Illegalsniffer  is  a  probe  by  a 
compromised  local  host  being  used  to  sniff  traffic,  and  is  detectable  only  in  the 
simulation  because  it  makes  reverse  DNS  lookups  to  resolve  sniffed  IP  addresses  to  host 
names.  Because  the  attack  is  prolonged,  and  because  all  of  the  local  hosts  are  victims, 
coincidences  are  likely. 
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Table  1.  Attacks  in  the  1999  DARPA  IDS  data  set  [12],  and  the  number  detected  {det)  out  of  the  total  number  in  the  available  data. 
Detections  are  for  merged  PHAD  and  ALAD  at  100  total  false  alarms,  after  removing  coincidences  and  simulation  artifacts  (TTL  field)  and 
adding  detections  by  Ethernet  address  (arppoison).  Attacks  listed  do  not  include  the  12  attacks  in  week  4  day  2  (missing  data)  or  9  attacks 
that  leave  no  evidence  in  the  inside  network  traffic  {selfping,  snmpget,  and  ntfsdos).  Hard  to  detect  attacks  (identified  by  *)  are  those  types 
which  were  detected  no  more  than  half  of  the  time  by  any  of  the  18  original  participants  [12,  Table  4],  Attack  descriptions  are  due  to  [15], 


Type  Attack  and  description  (*  =  hard  to  detect) 

Probe  illegalsniffer  -  compromised  local  host  sniffs  traffic 
Probe  ipsweep  (clear)  -  ping  random  IP  addresses 
Probe  *ipsweep  (stealthy  -  slow  scan) 

Probe  *ls  -  DNS  zone  transfer 
Probe  mscan  -  test  multiple  vulnerabilities 
Probe  ntinfoscan  -  test  multiple  NT  vulnerabilities 
Probe  portsweep  (clear)  -  test  multiple  ports 
Probe  *portsweep  (stealthy  -  slow  scan) 

Probe  *queso  -  malformed  packets  fingerprint  OS 
Probe  *resetscan  -  probe  with  RST  to  hide  from  IDS 
Probe  Satan  -  test  multiple  vulnerabilities 
DOS  apache2  -  crash  web  server  with  long  request 
DOS  *arppoison  -  spoofed  replies  to  ARP-who-has 
DOS  back  -  crash  web  server  with  "GET  /////..." 

DOS  crashiis  -  crash  NT  Webserver 

DOS  *dosnuke  -  URG  data  to  NetBIOS  crashes  Windows 

DOS  land  -  identical  src/dest  addr/ports  crashes  SunOS 

DOS  mailbomb  -  flood  SMTP  mail  server 

DOS  neptune  -  SYN  flood  crashes  TCP/IP  stack 

DOS  pod  (ping  of  death)  -  oversize  IP  pkt  crashes  TCP/IP 

DOS  processtable  -  server  flood  exhausts  UNIX  processes 

DOS  smurf  -  reply  flood  to  forged  ping  to  broadcast  address 

DOS  syslogd  -  crash  server  with  forged  unresolvable  IP 

DOS  *tcpreset  -  local  spoofed  RST  closes  connections 

DOS  teardrop  -  IP  fragments  with  gaps  crashes  TCP/IP  stack 

DOS  udpstorm  -  echo/chargen  loop  flood 

DOS  *warezclient  -  download  illegal  files  by  FTP 

DOS  warezmaster  -  upload  illegal  files  by  FTP 

R2L  diet  (guess  telnet/ftp/pop)  -  dictionary  password  guessing 

R2L  framespoofer  -  trojan  web  page 

R2L  ftpwrite  -  upload  "+  +"  to  .rhosts 

R2L  guest  -  simple  password  guessing 

R2L  httptunnel  -  backdoor  disguised  as  web  traffic 

R2L  imap  -  mailbox  server  buffer  overflow 

R2L  named  -  DNS  nameserver  buffer  overflow 

R2L  *ncftp  -  FTP  server  buffer  overflow 

R2L  *netbus  -  backdoor  disguised  as  SMTP  mail  traffic 

R2L  *netcat  -  backdoor  disguised  as  DNS  traffic 

R2L  phf  -  exploit  bad  Apache  CGI  script 

R2L  ppmacro  -  trojan  PowerPoint  macro  in  web  page 

R2L  sendmail  -  SMTP  mail  server  buffer  overflow 

R2L  *sshtrojan  -  fake  ssh  client  steals  password 

R2L  xlock  -  fake  screensaver  steals  password 

R2L  xsnoop  -  keystrokes  intercepted  on  open  X  server 

U2R  anypw  -  NT  bug  exploit 

U2R  casesen  -  NT  bug  exploit 

U2R  eject  -  UNIX  suid  root  buffer  overflow 

U2R  fdformat  -  UNIX  suid  root  buffer  overflow 

U2R  ffbconfig  -  UNIX  suid  root  buffer  overflow 

U2R  *loadmodule  -  UNIX  trojan  shared  library 

U2R  *perl  -  UNIX  bug  exploit 

U2R  ps  -  UNIX  bug  exploit 

U2R  *sechole  -  NT  bug  exploit 

U2R  *sqlattack  -  database  app  bug,  escape  to  user  shell 

U2R  xterm  -  UNIX  suid  root  buffer  overflow 

U2R  yaga  -  NT  bug  exploit 

Data  secret  -  copy  secret  files  or  access  unencrypted 

Total 


Det 

0/2 

1/4 

0/3 

0/2 

1/1 

2/3 

1/4 

2/11 

3/4 

0/1 

2/2 

3/3 

3/5 

0/4 

5/7 

4/4 

0/1 

3/3 

0/4 

4/4 

1/3 

1/5 

0/4 

1/3 

3/3 

2/2 

1/3 

1/1 

3/7 

0/1 

0/2 

0/3 

0/2 

0/2 

0/3 

4/5 

2/3 

2/4 

2/3 

1/3 

2/2 

1/3 

0/3 

0/3 

0/1 

2/3 

1/2 

2/3 

1/2 

0/2 

0/4 

0/3 

1/2 

0/2 

1/3 

1/4 

0/4 

70/180 


How  detected 

(1  coincidental  TCP  checksum  error) 

1  Ethernet  packet  size  =  52,  (1  TTL  =  253) 

(2  TTL  =  253) 

1  dest  IP/port,  flags  (1  coincidental  Ethernet  dest) 

2  HTTP  "HEAD",  1  FTP  "quit",  1  "user",  TCP  RST,  (2  TTL) 

1  FIN  without  ACK,  (1  TTL) 

2  FIN  without  ACK,  (5  TTL) 

2  FIN  without  ACK  (1  TTL) 

2  HTTP/  1  SMTP  "QUIT",  finger  /W,  IP  length,  sre  IP,  (TTL) 

3  source  IP,  1  HTTP  "x"  and  flags,  TCP  options  in  reply 

3  Ethernet  src/dest  address  (non-IP  packet) 

4  sotirce  IP  address,  1  unclosed  TCP  connection 
3  URG  pointer,  4  flags  =  UAPF 

3  SMTP  lowercase  "mail"  (1  TTL  =  253) 

(2  TTL  =  253) 

4  IP  fragment  pointer 
1  source  IP  address 

1  source  IP  address  (2  TTL) 

1  TCP  connection  not  opened  or  closed 
3  frag  ptr 

2  UDP  checksum  error 
1  source  IP  address 

1  source  IP  address 

2  FTP  "user",  1  dest  IP/port  (POP3),  1  sre  IP 


4  dest  IP/port,  1  SMTP  "RSET",  3  auth  "xxxx,25" 
2  source  IP  address,  (3  TTL) 

1  src/dest  IP,  (1  TTL) 

2  source  IP,  1  null  byte  in  HTTP  header 

1  source  IP  (and  TTL) 

2  source  IP  address,  2  global  dest  IP,  1  "Sender:" 
1  source  IP  address 


2  FTP  upload  (dest  IP/port  20,  flags,  FTP  "PWD"),  (1  TTL) 

1  FTP  upload  (sre  IP,  flags) 

2  FTP  upload  (sre  IP,  flags,  FTP  "STOR") 

1  SMTP  source  IP  address  (email  upload) 


1  FTP  upload  (dest  IP/port,  flags,  FTP  "STOR"),  (1  TTL) 

1  FTP  upload  (sotirce  IP,  dest  IP/port) 

1  FTP  upload  (sre  IP,  FTP  lowercase  "user"  ) 

(39%) ;  and  23/65  (35%)  of  hard  to  detect  attacks 
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There  are  25  attacks  detected  by  anomalous  TTL  values  in  PHAD,  which  we  believe  to 
be  simulation  artifacts.  TTL  (time  to  live)  is  an  8-bit  counter  decremented  each  time  an  IP 
packet  is  routed  in  order  to  expire  packets  to  avoid  infinite  routing  loops. 

Although  small  TTL  values  might  be  used  to  elude  an  IDS  by  expiring  the  packet 
between  the  IDS  and  the  target  [16],  this  was  not  the  case  because  the  observed  values 
were  large,  usually  126  or  253.  Such  artifacts  are  unfortunate,  but  probably  inevitable, 
given  the  difficulty  of  simulating  the  Internet.  A  likely  explanation  for  these  artifacts  is 
that  the  machine  used  to  simulate  the  attacks  was  a  different  real  distance  from  the  inside 
sniffer  than  the  machines  used  to  simulate  the  background  traffic.  We  did  not  count 
attacks  detected  solely  by  TTL.  After  adjusting  the  number  of  detections  in  the  det 
column,  we  detect  70  of  180  (39%)  of  attacks  at  100  false  alarms.  Among  the  poorly 
detected  attacks  [12,  Table  1],  we  detect  23  of  77  (30%),  or  23  of  65  (35%)  of  the  180 
detectable  attacks  in  our  data  set,  almost  the  same  rate  as  for  the  well  detected  attacks. 
This  is  a  good  result  because  an  anomaly  detection  system  such  as  ours  would  not  be 
used  by  itself,  but  rather  in  combination  with  other  systems  such  as  those  in  the  original 
evaluation  that  use  signature  detection  or  host  based  techniques.  In  order  for  the 
combination  to  be  effective,  there  must  be  a  significant  non-overlap,  and  our  results  show 
that.  We  should  also  point  out  that  when  we  developed  PHAD  and  ALAD,  we  did  so  with 
the  goal  of  improving  the  overall  number  of  detections  rather  than  just  the  poorly 
detected  attacks. 

More  detail  about  algorithms  and  experimental  results  is  described  in  [17]. 

4,4  Cost-Sensitive  Modeling 

Intrusion  detection  systems  must  maximize  the  realization  of  security  goals  while 
minimizing  costs.  In  this  project,  we  studied  the  problem  of  building  cost-sensitive 
intrusion  detection  models.  We  examined  the  major  cost  factors  associated  with  an  IDS, 
which  include  development  cost,  operational  cost,  damage  cost  due  to  successful 
intrusions,  and  the  cost  of  manual  and  automated  response  to  intrusions.  These  cost 
factors  can  be  qualified  according  to  a  defined  attack  taxonomy  and  site-specific  security 
policies  and  priorities.  We  defined  cost  models  to  formulate  the  total  expected  cost  of  an 
IDS,  and  developed  cost-sensitive  machine  learning  techniques  that  can  produce 
detection  models  that  are  optimized  for  user-defned  cost  metrics.  Empirical  experiments 
showed  that  our  cost-sensitive  modeling  and  deployment  techniques  are  effective  in 
reducing  the  overall  cost  of  intrusion  detection. 

4,4,1  Cost  Factors  and  Metrics 

In  order  to  build  cost-sensitive  ID  models,  we  must  first  understand  the  relevant  cost 
factors  and  the  metrics  used  to  define  them.  Borrowing  ideas  from  the  related  fields  of 
credit  card  and  cellular  phone  fraud  detection,  we  identify  the  following  major  cost 
factors  related  to  intrusion  detection:  damage  cost,  response  cost,  and  operational  cost. 
Damage  cost  (DCost)  characterizes  the  amount  of  damage  to  a  target  resource  by  an 
attack  when  intrusion  detection  is  unavailable  or  ineffective.  Response  cost  (RCost)  is  the 
cost  of  acting  upon  an  alarm  or  log  entry  that  indicates  a  potential  intrusion.  Operational 
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cost  (OpCost)  is  the  cost  of  processing  the  stream  of  events  being  monitored  by  an  IDS 
and  analyzing  the  activities  using  intrusion  detection  models. 

Cost-sensitive  models  can  only  be  constructed  and  evaluated  when  cost  metrics  are  given. 
The  issues  involved  in  the  measurement  of  cost  factors  have  been  studied  by  the 
computer  risk  analysis  and  security  assessment  communities.  The  literature  suggests  that 
attempts  to  fully  quantify  all  factors  involved  in  cost  modeling  usually  generate 
misleading  results  because  not  all  factors  can  be  reduced  to  discrete  dollars  (or  some 
other  common  unit  of  measurement)  and  probabilities  [18,  19,  20,  21,  22].  It  is 
recommended  that  qualitative  analysis  be  used  to  measure  the  relative  magnitudes  of  cost 
factors.  It  should  also  be  noted  that  cost  metrics  are  often  site-specific  because  each 
organization  has  its  own  security  policies,  information  assets,  and  risk  factors  [23]. 

4.4.1. 1  Attack  Taxonomy 

An  attack  taxonomy  is  essential  in  producing  meaningful  cost  metrics.  The  taxonomy 
groups  intrusions  into  different  types  so  that  cost  measurement  can  be  performed  for 
categories  of  similar  attacks.  Intrusions  can  be  categorized  and  analyzed  from  different 
perspectives.  Lindqvist  and  Jonsson  introduced  the  concept  of  the  dimension  of  an 
intrusion  and  used  several  dimensions  to  classify  intrusions  [24].  The  intrusion  results 
dimension  categorizes  attacks  according  to  their  effects  (e.g.,  whether  or  not  denial-of 
service  is  accomplished).  It  can  therefore  be  used  to  assess  the  damage  cost  and  response 
cost.  The  intrusion  techniques  dimension  categorizes  attacks  based  on  their  methods 
(e.g.,  resource  or  bandwidth  consumption).  It  therefore  affects  the  operational  cost  and 
the  response  cost.  Also,  the  intrusion  target  dimension  categorizes  attacks  according  to 
the  resource  being  targeted  and  affects  both  damage  and  response  costs. 

For  example,  using  the  DARPA  Intrusion  Detection  Evaluation  dataset,  our  attack 
taxonomy  first  categorizes  the  intrusions  occurring  in  the  dataset  into  ROOT,  DOS,  R2L, 
and  PROBE,  based  on  their  intrusion  results.  Then  within  each  of  these  5  categories,  the 
attacks  are  further  partitioned  by  the  techniques  used  to  execute  the  intrusion.  The 
ordering  of  sub-categories  is  of  increasing  complexity  of  the  attack  method.  Attacks  of 
each  sub-category  can  be  further  partitioned  according  to  the  attack  targets.  Eor 
simplicity,  the  intrusion  target  dimension  is  not  shown. 

4.4.1.2  Cost  Factors 

Damage  Cost  There  are  several  factors  that  determine  the  damage  cost  of  an  attack. 
Northcutt  uses  criticality  and  lethality  to  quantify  the  damage  that  may  be  incurred  by 
some  intrusive  behavior.  Criticality  measures  the  importance,  or  value,  of  the  target  of  an 
attack.  This  measure  can  be  evaluated  according  to  a  resource’s  functional  role  in  an 
organization  or  its  relative  cost  of  replacement,  unavailability,  and  disclosure  [21]. 
Similar  to  Northcutt’s  analysis,  we  assign  5  points  for  firewalls,  routers,  or  DNS  servers, 

4  points  for  mail  or  Web  servers,  2  points  for  UNIX  workstations,  and  1  point  for 
Windows  or  DOS  workstations.  Eethality  measures  the  degree  of  damage  that  could 
potentially  be  caused  by  some  attack.  Eor  example,  a  more  lethal  attack  that  helped  an 
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intruder  gain  root  access  would  have  a  higher  damage  cost  than  if  the  attack  gave  the 
intruder  local  user  access.  Other  damage  may  include  the  discovery  of  knowledge  about 
network  infrastructure  or  preventing  the  offering  of  some  critical  service.  For  each  main 
attack  category  in  our  attack  taxonomy,  we  define  a  relative  lethality  scale  and  use  it  as 
the  base  damage  cost,  or  basen-  When  assigning  damage  cost  according  to  the  criticality 
of  the  target,  we  can  use  the  intrusion  target  dimension.  Using  these  metrics,  we  can 
define  the  damage  cost  of  an  attack  targeted  at  some  resource  as  criticality  x  basco.  For 
example,  a  DOS  attack  targeted  at  a  firewall  has  DCost=150,  while  the  same  attack 
targeted  at  a  Unix  workstation  has  DCost=60.  In  addition  to  criticality  and  lethality,  we 
define  the  progress  of  an  attack  to  be  a  measure  of  how  successfully  an  attack  is  in 
achieving  its  goals.  For  example,  a  Denial-of-Service  (DOS)  attack  via  resource  or 
bandwidth  consumption  (e.g.  SYN  flooding)  may  not  incur  damage  cost  until  it  has 
progressed  to  the  point  where  the  performance  of  the  resource  under  attack  is  starting  to 
suffer.  The  progress  measure  can  be  used  as  an  estimate  of  the  percentage  of  the 
maximum  damage  cost  that  should  be  accounted  for.  That  is,  the  actual  cost  is  progress  x 
criticality  x  baseo-  However,  in  deciding  whether  or  not  to  respond  to  an  attack,  it  is 
necessary  to  compare  the  maximum  possible  damage  cost  with  the  response  cost.  This 
requires  that  we  assume  a  worst-case  scenario  in  which  progress  =  1.0. 

Response  Cost  Response  cost  depends  primarily  on  the  type  of  response  mechanisms 
being  used.  This  is  usually  determined  by  an  IDS’s  capabilities,  site-specific  policies, 
attack  type,  and  the  target  resource  [25].  Responses  may  be  either  automated  or  manual, 
and  manual  responses  will  clearly  have  a  higher  response  cost.  Responses  to  intrusions 
that  may  be  automated  include  the  following:  termination  of  the  offending  connection  or 
session  (either  killing  a  process  or  resetting  a  network  connection),  rebooting  the  targeted 
system,  recording  the  session  for  evidence  gathering  purposes  and  further  investigation, 
or  implementation  of  a  packet-filtering  rule  [26,  23].  In  addition  to  these  responses,  a 
notification  may  be  sent  to  the  administrator  of  the  offending  machine  via  e-mail  in  case 
that  machine  was  itself  compromised.  A  more  advanced  response  which  has  not  been 
successfully  employed  to  date  could  involve  the  coordination  of  response  mechanisms  in 
disparate  locations  to  halt  intrusive  behavior  closer  to  its  source.  Additional  manual 
responses  to  an  intrusion  may  involve  further  investigation  (perhaps  to  eliminate  action 
against  false  positives),  identification,  containment,  eradication,  and  recovery  [23].  The 
cost  of  manual  response  includes  the  labor  cost  of  the  response  team,  the  user  of  the 
target,  and  any  other  personnel  that  participate  in  response.  It  also  includes  any  downtime 
needed  for  repairing  and  patching  the  targeted  system  to  prevent  future  damage.  We 
estimate  the  relative  complexities  of  typical  responses  to  each  attack  type  in  Table  1  in 
order  to  define  the  relative  base  response  cost,  or  bascR.  Again,  we  can  take  into  account 
the  criticality  of  the  attack  target  when  measuring  response  cost.  That  is,  the  cost  is 
criticality  x  bascR.  In  addition,  attacks  using  simpler  techniques  generally  have  lower 
response  costs  than  more  complex  attacks,  which  require  more  complex  mechanisms  for 
effective  response. 

Operational  Cost  The  main  cost  inherent  in  the  operation  of  an  IDS  is  the  amount  of 
time  and  computing  resources  needed  to  extract  and  test  features  from  the  raw  data 
stream  that  is  being  monitoredl.  We  associate  OpCost  with  time  because  a  real-time  IDS 
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must  detect  an  attack  while  it  is  in  progress  and  generate  an  alarm  as  quickly  as  possible 
so  that  damage  can  be  minimized.  A  slower  IDS  which  uses  features  with  higher 
computational  costs  should  therefore  be  penalized.  Even  if  a  computing  resource  has  a 
“sunken  cost”  (e.g.,  a  dedicated  IDS  box  has  been  purchased  in  a  single  payment),  we 
still  assign  some  cost  to  the  expenditure  of  its  resources  as  they  are  used.  If  a  resource  is 
used  by  one  task,  it  may  not  be  used  by  another  task  at  the  same  time.  The  cost  of 
computing  resources  is  therefore  an  important  factor  in  prioritization  and  decision 
making.  Some  features  cost  more  to  gather  than  others.  However,  costlier  features  are 
often  more  informative  for  detecting  intrusions.  For  example,  features  that  examine 
events  across  a  larger  time  window  have  more  information  available  and  are  often  used 
for  “correlation  analysis”  in  order  to  detect  extended  or  coordinated  attacks  such  as  slow 
host  or  network  scans.  Computation  of  these  features  is  costly  because  of  their  need  to 
store  and  analyze  larger  amounts  of  data.  Based  on  our  extensive  experience  in  extracting 
and  constructing  predictive  features  from  network  audit  data,  we  classify  features  into 
four  relative  levels,  based  on  their  computational  costs: 

•  Level  I  features  can  be  computed  from  the  first  packet,  e.g.,  the  service. 

•  Level  2  features  can  be  computed  at  any  point  during  the  life  of  the  connection, 
e.g.,  the  connection  state  {SYN  WAIT,  CONNECTED,  FIN  WAIT,  etc.). 

•  Level  3  features  can  be  computed  at  the  end  of  the  connection,  using  only 
information  about  the  connection  being  examined,  e.g.,  the  total  number  of  bytes 
sent  from  source  to  destination. 

•  Level  4  features  can  be  computed  at  the  end  of  the  connection,  but  require  access 
to  data  of  potentially  many  other  prior  connections.  These  are  the  temporal  and 
statistical  features  and  are  the  most  costly  to  compute.  The  computation  of  these 
features  may  require  values  of  the  lower  level  (i.e.,  levels  I,  2,  and  3)  features. 

We  can  assign  relative  magnitudes  to  these  features  according  to  their  computational 
costs.  For  example,  level  I  features  may  cost  I,  level  2  features  may  cost  5,  level  3 
features  may  cost  10,  and  level  4  features  may  cost  100.  These  estimations  have  been 
verified  empirically  using  a  prototype  system  for  evaluating  our  ID  models  in  real-time 
that  has  been  built  in  coordination  with  Network  Flight  Recorder  [27]. 

4,4,2  Cost  Models 

A  cost  model  formulates  the  total  expected  cost  of  intrusion  detection.  It  considers  the 
trade-off  among  all  relevant  cost  factors  and  provides  the  basis  for  making  appropriate 
cost-sensitive  detection  decisions.  We  first  examine  the  cost  trade-off  associated  with 
each  possible  outcome  of  observing  some  event  e,  which  may  represent  a  network 
connection,  a  user’s  session  on  a  system,  or  some  logical  grouping  of  activities  being 
monitored.  In  our  discussion,  we  say  that  e=(a,p,r)  is  an  event  described  by  the  attack 
type  a  (which  can  be  normal  for  a  truly  normal  event),  the  progress  p  of  the  attack,  and 
the  target  resource  r.  The  detection  outcome  of  e  is  one  of  the  following:  false  negative 
(FN),  false  positive  (FP),  true  positive  (TP),  true  negative  (TN),  or  misclassified  hit.  The 
costs  associated  with  these  outcomes  are  known  as  consequential  costs  (CCost),  as  they 
are  incurred  as  a  consequence  of  prediction,  and  are  outlined  in  Table  2. 


16 


FN  Cost  is  the  cost  of  not  detecting  an  attack,  and  is  always  incurred  by  systems  that  do 
not  install  IDSs.  When  an  IDS  falsely  decides  that  a  connection  is  not  an  attack  and  does 
not  respond  to  the  attack,  the  attack  will  succeed,  and  the  target  resource  will  be 
damaged.  The  FN  Cost  is  therefore  defined  as  the  damage  cost  associated  with  event  e,  or 
DCost(e). 

TP  Cost  is  incurred  in  the  event  of  a  correctly  classified  attack,  and  involves  the  cost  of 
detecting  the  attack  and  possibly  responding  to  it.  To  determine  whether  response  will  be 
taken,  RCost  and  DCost  must  be  considered.  If  the  damage  done  by  the  attack  to  resource 
r  is  less  than  RCost,  then  ignoring  the  attack  actually  reduces  the  overall  cost.  Therefore, 
if  RCost(e)  >  DCost(e),  the  intrusion  is  not  responded  to  beyond  simply  logging  its 
occurrence,  and  the  loss  is  DCost(e).  Otherwise,  the  intrusion  is  acted  upon  and  the  loss  is 
limited  to  RCost(e).  In  reality,  however,  by  the  time  an  attack  is  detected  and  response 
ensues,  some  damage  may  have  incurred.  To  account  for  this,  TP  cost  may  be  defined  as 
RCost(e)  +  8DCost(e),  where  8  e  [0,1]  is  a  function  of  the  progress  p  of  the  attack. 

FP  Cost  is  incurred  when  an  event  is  incorrectly  classified  as  an  attack,  i.e.,  when 
e=(normal,p,r)  is  misidentified  as  e=(a,p’,r)  for  some  attack.  If  RCost(e’)<DCost(e’),  a 
response  will  ensue  and  the  response  cost,  RCost(e’),  must  be  accounted  for  as  well. 

Also,  since  normal  activities  may  be  disrupted  due  to  unnecessary  response,  false  alarms 
should  be  penalized.  For  our  discussion,  we  use  PCost(e)  to  represent  the  penalty  cost  of 
treating  a  legitimate  event  e  as  an  intrusion.  For  example,  if  e  is  aborted,  PCost(e)  can  be 
the  damage  cost  of  a  DOS  attack  on  resource  r  ,  because  a  legitimate  user  may  be  denied 
access  to  r. 

TN  Cost  is  always  0,  as  it  is  incurred  when  an  IDS  correctly  decides  that  an  event  is 
normal.  We  therefore  bear  no  cost  that  is  dependent  on  the  outcome  of  the  decision. 

Misdassified  Hit  Cost  is  incurred  when  the  wrong  type  of  attack  is  identified,  i.e.,  an 
event  e=(a,p,r)  is  misidentified  as  e’=(a’,p’,r).  If  RCost(e’)  <  DCost(e’),  a  response  will 
ensue  and  RCost(e’)  needs  to  be  accounted  for.  Since  the  response  taken  is  effective 
against  attack  type  e’  rather  than  e,  some  damage  cost  of  8DCost(e),  where  8  e  [0,1],  will 
be  incurred  due  to  the  true  attack. 

We  can  now  define  the  cost  model  for  an  IDS.  When  evaluating  an  IDS  over  some 
labeled  test  set  E,  where  each  event,  e  eE  has  a  label  of  normal  or  one  of  the  intrusions, 
we  define  the  cumulative  cost  of  the  IDS  as  follows: 

CumulativeCost(E)=Xe(OpCost(e)+CCost(e)) 

where  CCost(e),  the  consequential  cost  of  the  prediction  by  the  IDS  on  e,  is  defined  in 
Table  2. 


17 


Tab] 

ie  2:  Model  for  Consequential  Cost 

Outcome 

Consequential  Cost 

CCost(e) 

Condition 

Miss  (FN) 

DCost(e) 

False  Alarm  (FP) 

RCost(e’)+PCost(e); 

0 

If  DCost(e’)>RCost(e’); 
Otherwise 

Hit  (TP) 

RCost(e)  +  8DCost(e); 
DCost(e) 

If  DCost(e)>RCost(e); 
Otherwise 

Normal  (77V) 

0 

Misclassified  Hit 

RCost(e’)  +  8DCost(e); 
DCost(e) 

If  DCost(e’)>RCost(e’); 
Otherwise 

4.4.3  Reducing  Operational  Cost 

In  order  to  reduce  OpCost,  ID  models  need  to  use  low  cost  features  as  often  as  possible  while  still 
maintaining  a  desired  level  of  accuracy.  Our  approach  is  to  build  multiple  ID  models,  each  of 
which  uses  different  sets  of  features  at  different  cost  levels.  Low  cost  models  are  always 
evaluated  first  by  the  IDS,  and  high  cost  models  are  used  only  when  the  low  cost  models  cannot 
make  a  prediction  with  sufficient  accuracy.  We  implemented  this  multiple-model  approaeh 
using  RIPPER  [5],  a  rule  induction  algorithm. 

4.4.4  Reducing  Consequential  Cost 

A  traditional  IDS  that  does  not  consider  the  trade-off  between  RCost  and  DCost  will 
attempt  to  respond  to  every  intrusion  that  it  detects.  As  a  result,  the  consequential  cost  for 
FP,  TP,  and  misclassified  hits  will  always  include  some  response  cost.  We  use  a  cost- 
sensitive  decision  module  to  determine  whether  response  should  ensue  based  on  whether 
DCost  is  greater  than  RCost.  The  decision  module  takes  as  input  an  intrusion  report 
generated  by  the  detection  module.  The  report  contains  the  name  of  the  predicted 
intrusion  and  the  name  of  the  target,  which  are  then  used  to  look  up  the  pre-determined 
DCost  and  RCost.  If  DCost  >  RCost,  the  decision  module  invokes  a  separate  module  to 
initiate  a  response;  otherwise,  it  simply  logs  the  intrusion  report. 

4.4.5  Experimental  Results 

Our  experiments  used  data  that  was  distributed  by  the  1998  DARPA  Intrusion  Detection 
Evaluation  Program.  We  used  80%  of  the  data  for  training  the  detection  models.  The 
remaining  20%  were  used  as  a  test  set  for  evaluation  of  the  cost-sensitive  models. 

Our  results  showed  that  the  multiple-model  approach  can  achieve  a  78%  reduction  in 
operational  cost,  and  that  the  consequential  cost  can  be  reduced  90%. 
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4,5  Adaptive  IDS 


We  advocate  enabling  an  IDS  to  provide  performance  adaptation,  that  is,  the  best 
possible  performance  for  the  given  operation  environment.  It  is  extremely  difficult,  if  not 
impossible,  for  an  IDS  to  be  100%  accurate.  The  optimal  performance  of  an  IDS  should 
be  determined  by  not  only  its  ROC  (Receiver  Operating  Characteristics)  curve  of 
detection  rate  versus  false  alarm  rate,  but  also  its  cost  metrics  (e.g.,  damage  cost  of 
intrusion)  and  the  probability  of  intrusion.  Accordingly,  performance  adaptation  means 
that  an  IDS  should  always  maximize  its  cost-benefits  for  the  given  (current)  operational 
conditions.  For  example,  if  an  IDS  is  forced  to  miss  some  intrusions  (that  can  otherwise 
be  detected  using  its  “signature  base”),  for  example,  due  to  stress  or  overload  attacks,  it 
should  still  ensure  that  the  best  value  (or  minimum  damage)  is  provided  according  to 
cost-analysis  on  the  circumstances.  As  a  simple  example,  if  we  regard  buffer-overflow  as 
more  damaging  than  port-scan  (and  for  argument  sake  all  other  factors,  for  example, 
attack  probability,  detection  probability,  are  equal),  then  missing  a  port-scan  is  better  than 
missing  a  buffer-overflow.  In  this  research,  we  developed  a  framework  for  considering 
the  trade-offs  of  IDS  performance  objectives.  We  have  developed  techniques  for  run-time 
performance  measurement  and  monitoring,  and  for  dynamic  adaptation  and 
reconfiguration  of  IDS  policies  and  mechanisms.  We  focused  our  work  on  misuse 
detection  systems. 

4.5.1  IDS  Performance  Metrics 

4.5.1. 1  Expected  Value 

The  purpose  of  a  real-time  IDS  is  to  detect  intrusions  and  prevent  damages.  Instead  of 
using  mere  statistical  accuracy,  we  should  evaluate  an  IDS  according  to  its  value  (or  cost- 
benefit).  For  each  attack  Ai,  an  IDS  equipped  with  the  detection  rule  Ri  (and  the  necessary 
preprocessing  and  logging  tasks)  for  Ai  provides  the  expected  value: 

V,  =  cP,p,(l-pO-C“  (1-pO  a, 

is  the  damage  cost,  pi  is  the  prior  priority  of  the  intrusion.  Pi  is  the  false  negative  rate, 
C“i  is  the  false  alarm  cost,  ai  is  the  false  alarm  rate.  The  first  term  is  the  loss  (damage) 
prevented  because  of  true  detection,  and  the  second  term  is  the  loss  incurred  because  of 
false  alarms.  The  total  value  of  an  IDS  depends  on  its  configuration,  that  is,  its  collection 
of  analysis  tasks  and  hence  the  attacks  that  it  “covers”.  It  is  simply  Vi. 


4,5,1,2  Response  Time 

Upon  arrival  in  the  system,  audit  records  are  placed  in  a  (common)  queue  (e.g.,  the 
libpcap  buffer).  The  queue  has  only  one  server,  the  audit  data  processing  and  intrusion 
analysis  unit.  The  processing  and  analysis  tasks  for  each  audit  record  are  applied 
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sequentially.  That  is,  eaeh  event  goes  through  a  sequenee  of  analysis  tasks.  The  proeess 
terminates  if  a  deteetion  rule  Ri  determines  that  the  event  is  (part  of)  an  intrusion.  Or  the 
proeess  ends  when  all  analysis  is  done  and  the  event  is  deemed  normal. 

The  expeeted  system  time  of  a  newly  arrived  audit  record  includes  the  queuing  time  plus 
the  service  time  the  record.  The  queuing  time  is  simply  the  sum  of  the  service  time  for  the 
audit  records  that  are  already  in  the  IDS.  The  service  time  of  an  audit  record  is  the  sum  of 
processing  time  of  each  task  that  is  applied  on  the  record. 

Obviously,  if  an  IDS  has  a  response  time  that  is  larger  than  the  inter-arrival-time  of  audit 
records,  the  queue  can  be  fdled  up  and  newly  arrived  records  will  be  “dropped”.  As  a 
result,  the  IDS  cannot  reliable  detect  intrusions  or  may  output  more  false  alarms. 
Therefore,  it  is  important  that  an  IDS  operates  under  the  constraint  that  its  response  time 
is  smaller  than  the  inter-arrival-time  of  audit  records. 

4.5.2  Performance  Optimization  and  Adaptation 

It  is  not  always  possible  to  run  an  IDS  with  its  “full”  configurations,  i.e.,  with  all  analysis 
tasks  enabled.  For  example,  if  there  is  a  high-volume  and  high-speed  network  traffic,  the 
inter-arrival-time  of  packets  will  be  very  small.  If  the  IDS  continues  to  run  in  its  full 
configuration,  its  response  time  is  likely  to  exceed  the  inter-packet-arrival-time. 

Our  goal  is  then  to  configure  an  IDS  to  provide  the  best  value  while  operating  under  the 
above  constraints.  That  is,  if  an  IDS  cannot  accommodate  all  desirable  analysis  tasks 
(without  violating  the  constraints),  it  should  just  include  the  more  valuable  tasks  (we  also 
assume  that  additional  and  orthogonal  optimization  techniques,  such  as  rule-set  ordering, 
can  be  used).  For  example,  an  IDS  should  always  detect  “buffer-overflow”  and  only 
analyze  “slow  scan”  when  time  permits.  More  formally,  we  need  to  solve  the  following 
performance  optimization  problem:  select  a  set  of  analysis  tasks  for  the  IDS  in  such  a 
way  that  the  total  value  of  the  IDS  is  maximized  while  it  still  operates  under  the 
constraint  that  its  response  time  is  smaller  than  the  inter- arrival-time  of  audit  records. 

We  note  that  the  solution  to  the  above  optimization  problem  depends  on  the  traffic  and 
attack  conditions.  This  means  that  in  run-time,  if  we  use  a  pre-computed  IDS 
configuration,  it  may  not  provide  the  optimal  value  because  traffic  and  attack  conditions 
can  change.  We  define  performance  adaptation  as  the  process  of  dynamically 
reconfiguring  an  IDS  to  provide  the  optimal  value  given  the  current  run-time  constraints. 

Performance  adaptation  relies  on  performance  monitoring  in  run-time  to  detect  the 
conditions  (e.g.,  “stress”)  that  cause  performance  degradation  and  to  measure  the 
parameter  values  needed  for  solving  the  optimization  problem. 

4.5.3  Performance  Optimization  and  Adaptation 

We  experimented  with  two  open-source  IDSs,  Bro  and  Snort.  For  both  systems,  we 
showed  that  an  attacker  can  purposely  create  stress  conditions,  by  flooding  the  network 
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with  traffic  that  will  require  a  lot  of  proeessing,  and  then  launeh  attaeks  that  the  IDSs  will 
miss  beeause  of  paeket  drops. 

We  then  modified  both  Bro  and  Snort.  We  added  performance  measurement  and 
monitoring  eodes,  and  modules  for  solving  the  performanee  optimization  problem.  Our 
experiments  showed  that  the  modified  systems  ean  dynamieally  ehange  eonfigurations 
when  stressed,  in  sueh  a  way  that  although  some  types  of  paekets  will  not  be  proeessed, 
the  more  important  attaeks  are  still  deteeted.  That  is,  the  modified  systems  have 
performanee  adaptation  abilities. 

Please  see  [28]  for  more  detail  of  this  work. 

4,6  Alert  Analysis  and  Attack  Scenario  Analysis 

The  individual  alerts  from  IDSs  alone  may  not  be  sufficient  to  deteet  or  deeipher  the 
stealth  or  sophistieated  attaek  aetivities.  A  higher-level  analysis  is  neeessary. 

The  main  focus  of  this  research  task  was  to  develop  analysis  algorithms  that  ean  discover 
new  (or  novel)  relationships  among  alerts.  Rather  than  relying  on  a  priori  alert 
eorrelation  knowledge,  our  algorithm  uses  a  statistical  causality  analysis  technique  eall 
Granger  Causality  Test  (GCT)  to  eorrelate  alerts  and  diseover  (new)  relationship  among 
attaek  steps  or  anomaly  aetivities. 

The  intuition  is  that  attack  steps  that  do  not  have  well-known  patterns  or  obvious 
relationships  may  nonetheless  have  some  statistieal  eorrelations  in  the  alert  data. 

GCT  uses  statistieal  funetions  to  test  if  lagged  information  on  a  time-series  variable  x 
provides  any  statistieally  significant  information  about  another  time-series  variable  y.  If 
the  answer  is  yes,  we  say  variable  x  Granger-eauses  y. 

We  model  variable  y  by  the  Autoregressive  Model  (AR  Model)  and  Autoregressive 
Moving  Average  Model  (ARM A  Model).  GCT  eompares  the  residuals  of  both  AR  Model 
and  ARMA  Model.  GCT  eompares  the  residuals  of  the  AR  Model  with  the  residuals  of 
the  ARMA  Model.  Speeifieally,  for  two  time  series  variables  y  and  x  with  size  N,  the  AR 
Model  and  ARMA  Model  ofy  are  defined  as: 

p 

AR  Model:  y{k)  -  ^  0^y{k  -  i)  +6^  (k) ; 

i=l 

P  P 

ARMA  Model:  y{k)  =  E  aj{k  -  i)  +Y,  PiU{k  -  i)  +  gj  {k) 

I-l  7-1 

Where  is  a  partieular  lag  length,  and  parameters  a, ,  ,  6*,  (1  <  /  <  p)  are  eomputed  in 

the  proeess  of  solving  the  Ordinary  Least  Square  (OLS)  problem.  The  residuals  of  the  AR 

r  T 

Model  and  ARMA  Model  are:  Rq  =^eo(A:)  andf?i  =^ef(A:)  respeetively  with  7'=A-/7. 

k=l  k=\ 

The  Null  Hypothesis  Ho  of  GCT  is  Hq:  Pi=0,  i=l,  2,  . . .,  p.  That  is,  x  does  not  affeety  up 
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to  a  delay  of p  time  units.  We  denote  g  as  the  Granger  Causality  Index  (GCI): 

{R,-R,)lp 


g  = 


RJ{T-2p-\) 


F{pJ-lp-\) 


Here,  F(a,b)  is  Fisher's  F  distribution  with  parameters  a  and  b  [29].  F-test  is  eondueted  to 
verify  the  validity  of  the  Null  Hypothesis.  If  the  value  of  g  is  larger  than  a  threshold  in 
the  F-test,  then  we  rejeet  the  Null  Hypothesis  and  eonelude  that  x  Granger-eauses  y.  The 
intuition  of  GCI  g  is  that  it  indieates  how  better  variable  y  ean  be  predieted  using  histories 
of  both  variable  x  and  than  it  ean  using  the  history  ofy  alone.  We  say  that  variable  xj(k)  is 
more  likely  to  be  eausally  related  withy(A:)  than  X2(b)  if  gj  >  g2  and  both  have  passed  the 
F-test,  where  i=  1,2  denotes  the  GCI  for  the  input-output  pair  (x/,y). 


Applying  GCT  to  alert  eorrelation,  the  task  is  to  analyze  the  (timestamped)  alert  streams 
and  determine  whieh  pairs  of  alerts  have  eausal  relationships.  In  a  preliminary  reeent 
study  as  part  of  our  DARPA  Cyber  Panel  Program  projeet,  we  applied  our  algorithms  to 
the  datasets  of  the  DRAPA  Grand  Challenge  Problem  (GCP).  The  GCP  dataset  ineludes 
multiple  stealth  worm  attaek  seenarios.  Our  alert  eorrelation  algorithms  ean  eorreetly 
diseover  both  the  obvious  and  hidden  pattern  of  eausal  relationships  among  attaeks.  For 
example,  in  Seenario  I,  we  ean  also  diseover  the  mutual  eausal  relationship  between 
worm’s  malieious  aetivities  of  illegal  file  aeeess  (to  install  agent  software  and  eolleet 
sensitive  data),  uploading  the  stolen  data  to  an  external  site,  and  downloading  new  agent 
software.  In  Seenario  II,  we  ean  diseover  the  eausality  between  worm  attaek  and  server’s 
abnormal  serviee  status. 


More  details  ean  be  found  in  [30]. 

5.  Conclusions 

In  this  projeet,  we  studied  how  to  build  eost-sensitive  and  light  intrusion  deteetion 
models.  Our  goal  was  to  automate  as  much  as  analysis  tasks  in  intrusion  detection  as 
possible.  The  main  research  activities  were  in: 

•  Automatic  feature  construction  by  analyzing  the  patterns  of  normal  and  intrusion 
activities  computed  from  large  amount  of  audit  data 

•  Light-weight  anomaly  detection  algorithms  using  patterns  of  packet  headers  and 
payloads 

•  Study  of  cost  factors  in  intrusion  detection.  Using  cost-sensitive  machine  learning 
algorithms  to  construct  intrusion  detection  models  that  achieve  optimal 
performance  on  the  given  cost  metrics 

•  Dynamic  (re-)configuration  to  make  IDS  more  effective  and  efficient,  and 
resilient  to  IDS-related  attacks 

•  Using  statistical  causality  analysis  to  discover  new  attack  step  relationships 

We  have  developed  algorithms  and  prototype  systems,  and  have  conducted  extensive 
experiments  using  DARPA  datasets  and  other  real-world  datasets.  The  results  showed 
that  the  technologies  we  developed  in  this  project  are  far  more  advanced  and  better  than 
the  state-of-the-art. 
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The  aims  of  the  project  were  met.  In  fact,  we  went  beyond  the  original  proposal.  The 
results  of  this  research  have  been  reported  in  many  publications.  In  addition,  we  have 
actively  engaged  in  technology  transfer  throughout  the  course  of  the  project.  In  particular, 
the  Pis  were  involved  in  the  founding  of  System  Detection  Inc. 
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